Background
Load and Process Files
The summary is looking for files in the /Library/Frameworks/R.framework/Versions/4.1/Resources/library/MappgeneSummary/test_files/ directory, and will write the following files there. Creation of the .Rds files are typically the slowest part of this script, so they are saved to speed up testing and creation of the html summary.
| File Name | File Type | Notes |
|---|---|---|
| _summary_lofreq.Rds | R data | The lofreq data (tibble) |
| _summary_ivar.Rds | R data | The ivar data (tibble) |
| _summary_DF.Rds | R data | The “final” data use for all analysis (tibble) |
| _summary_DF.txt | tab-delimited | The “final” data use for all analysis |
| _lofreq_top_outbreak_results.txt | tab-delimited | The Outbreak API results from the top lofreq iSNVs |
| _ivar_top_outbreak_results.txt | tab-delimited | The Outbreak API results from the top ivar iSNVs |
| _amplicon_summary.txt | tab-delimited | Attempt at re-creating a specific report for Excel analysis |
| _amplicon_results.txt | tab-delimited | Attempt at re-creating a specific report for Excel analysis |
Lofreq (iSNVs <1%)
This is the “iVar -> LoFreq -> snpEFF/snpSIFT” workflow, and these results will have the “lofreq” PIPELINE string.
The ALT_DP column has been added as as.integer(DP * AF).
Also, converting the HGVS_P column to REF_AA and ALT_AA, and removing any synonymous mutations.
iVar (iSNVs > 1%)
This is the “iVar -> snpEFF/snpSIFT” workflow, and these results will have the “ivar” PIPELINE string.
iVar output was processed according to Jimmy’s pipeline
- removing variant position with an ALT_QUAL of 20 (these are almost entirely single nt insertions),
- removing positions that did not pass the Fisher test (PASS=FALSE)
- removing synonymous mutations (REF_AA=ALT_AA)
The ALT_DP column has been added as as.integer(DP * AF).
Also, converting the HGVS_P column to REF_AA and ALT_AA, and removing any synonymous mutations.
Pipeline Summary and Merging
SAMPLEs: There are 3 total samples with data so far, 0 done with just lofreq, 0 done with just ivar, and 3 with both.
SNVs: There are 244 total SNVs identified so far that are below 1% in ‘lofreq’ in at least one sample, or above 1% in ‘ivar’ in at least one sample, 213 found with just lofreq, 26 found with just ivar, and 5 found with both.
Effects
snpSIFT classifies the SNV effect into several categories. Here are the categories and their total frequencies in all samples. Only the “blue” categories are used in this analysis. Most the of “ivar” effect categories are removed with the ALT_QUAL > 20 filtering.
| EFFECT | SO | Description |
|---|---|---|
| missense_variant | SO:0001583 | A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved. |
| stop_gained | SO:0001587 | A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened polypeptide |
| conservative_inframe_deletion | SO:0001825 | An inframe decrease in cds length that deletes one or more entire codons from the coding sequence but does not change any remaining codons |
| disruptive_inframe_deletion | SO:0001826 | An inframe decrease in cds length that deletes bases from the coding sequence starting within an existing codon |
| conservative_inframe_insertion | SO:0001823 | An inframe increase in cds length that inserts one or more codons into the coding sequence between existing codons |
| disruptive_inframe_insertion | SO:0001824 | An inframe increase in cds length that inserts one or more codons into the coding sequence within an existing codon. |
ALT_AAs and EFFECT
SNV Sample Count per Pipeline
Each SNV was given a unique identifier with the gene name, the chromosome position, and the REF and ALT nucleotide, all separated by .. So, S.23367.C.A is in the S gene, chromosome position 23,367, and results in the reference “C” converted to a “A”.
Below is the Sample count of all unique SNVs in both the lofreq and ivar pipelines.
Visualization
Just some examples of different plots that can be generated. These are typically static images, although some interaction could potentially be added later.
Heatmap
NMDS
Between Pipelines
The stress of the above plot is 0.0054846 (values > 10-15% are generally “high stress”).
Between Samples
If a SNV was found in a sample by both ivar and lofreq, the frequency is averaged.
The stress of the above plot is 0 (values > 10-15% are generally “high stress”).
Summary Tables
Amplicon Summary
This table is too big to display here. It is saved as a tab-delimited file here: ./_amplicon_summary.txt.
Amplicon Result
This table is too big to display here. It is saved as a tab-delimited file here: ./_amplicon_result.txt.
outbreak.info API
Note, only mutations with the ‘missense_variant’ EFFECT are searchable in the API. All others are not included below.
S Gene
S:G566V not found in CA
S:N658H not found in CA
Session Info
| name | value |
|---|---|
| version | R version 4.1.0 (2021-05-18) |
| os | macOS Catalina 10.15.7 |
| system | x86_64, darwin17.0 |
| ui | X11 |
| language | (EN) |
| collate | en_US.UTF-8 |
| ctype | en_US.UTF-8 |
| tz | America/Los_Angeles |
| date | 2021-11-04 |